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Abstract 



■ This paper focuses on the problem of kernehzing an existing supervised Mahalanobis 
, distance learner. The following features are included in the paper. Firstly, three popular 

' ' ' learners, namely, "neighborhood component analysis", "large margin nearest neighbors" 

I and "discriminant neighborhood embedding" , which do not have kernel versions are ker- 

' nelized in order to improve their classification performances. Secondly, an alternative ker- 

1— il nelization framework called "KPCA trick" is presented. Implementing a learner in the new 

' framework gains several advantages over the standard framework, e.g. no mathematical 

^ 1 I formulas and no reprogramming are required for a kernel implementation, the framework 

^ • avoids troublesome problems such as singularity, etc. Thirdly, while the truths of repre- 

^ I senter theorems are just assumptions in previous papers related to ours, here, representer 
theorems are formally proven. The proofs validate both the kernel trick and the KPCA 
trick in the context of Mahalanobis distance learning. Fourthly, unlike previous works 

^ ■ which always apply brute force methods to select a kernel, we investigate two approaches 

T-H I which can be efficiently adopted to construct an appropriate kernel for a given dataset. 

\j ■ Finally, numerical results on various real-world datasets are presented. 
'nJ" ■ 

■ 1. Introduction 

o : 

OO , Recently, many Mahalanobis distance learners are invented (Chen et al., 2005; Goldberger 

O ■ et al., 2005; Weinberger et al., 2006; Yang et al., 2006; Sugiyama, 2006; Yan et al., 2007; 

■ Zhang et al., 2007; Torresani & Lee, 2007; Xing et al., 2003). These recently proposed 
^ , learners are carefully designed so that they can handle a class of problems where data of 

^ ' one class form multi-modality where classical learners such as principal component analysis 

— ■ (PCA) and Fisher discriminant analysis (FDA) cannot handle. Therefore, promisingly, the 

new learners usually outperform the classical learners on experiments reported in recent 
papers. Nevertheless, since learning a Mahalanobis distance is equivalent to learning a 
linear map, the inability to learn a non-linear transformation is one important limitation of 
all Mahalanobis distance learners. 

As the research in Mahalanobis distance learning has just recently begun, several issues 
are left open such as (1) some efficient learners do not have non-linear extensions, (2) 
the kernel trick (Scholkopf & Smola, 2001), a standard non-linearization method, is not 
fully automatic in the sense that new mathematical formulas have to be derived and new 
programming codes have to be implemented; this is not convenient to non-experts, (3) 
existing algorithms "assume" the truth of the representer theorem (Scholkopf & Smola, 
2001, Chapter 4); however, to our knowledge, there is no formal proof of the theorem in the 
context of Mahalanobis distance learning, and (4) the problem of how to select an efficient 
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kernel function has been left untouched in previous works; currently, the best kernel is 

achieved via a brute-force method such as cross validation. 
In this paper, we highlight the following key contributions: 

• Three popular learners recently proposed in the literatures, namely, neighborhood com- 
ponent analysis (NCA) (Goldberger et al., 2005), large margin nearest neighbors (LMNN) 
(Weinberger et al., 2006) and discriminant neighborhood embedding (DNE) (Zhang et al., 
2007) are kernelized in order to improve their classification performances with respect to 
the kNN algorithm. 

• A KPCA trick framework is presented as an alternative choice to the kernel-trick 

framework. In contrast to the kernel trick, the KPCA trick does not require users to derive 
new mathematical formulas. Also, whenever an implementation of an original learner is 
available, users are not required to re-implement the kernel version of the original learner. 
Moreover, the new framework avoids problems such as singularity in eigen-decomposition 
and provides a convenient way to speed up a learner. 

• Two representer theorems in the context of Mahalanobis distance learning are proven. 
Our theorems justify both the kernel-trick and the KPCA-trick frameworks. Moreover, the 
theorems validate kernelized algorithms learning a Mahalanobis distance in any separable 
Hilbcrt space and also cover kernelized algorithms performing dimensionality reduction. 

• The problem of efficient kernel selection is dealt with. Firstly, we investigate the kernel 
alignment method proposed in previous works (Lanckriet et al., 2004; Zhu et al., 2005) to 
see whether it is appropriate for a kernelized Mahalanobis distance learner or not. Secondly, 
we investigate a simple method which constructs an unweighted combination of base kernels. 
A theoretical result is provided to support this simple approach. Kernel constructions based 
on our two approaches require much shorter running time when comparing to the standard 
cross validation approach. 

• As kNN is already a non-linear classifier, there are some doubts about the usefulness 
of kernelizing Mahalanobis distance learners (Weinberger et al., 2006, pp. 8). We pro- 
vide an explanation and conduct extensive experiments on real-world datasets to prove the 
usefulness of the kernelization. 



2. Background 

Let {xj,yj}^j^ denote a training set of n labeled examples with inputs Xj G MP and cor- 
responding class labels yi € {ci, ...,Cp}. Any Mahalanobis distance can be represented by 
a symmetric positive semi-definite (PSD) matrix M G S^?. Here, we denote §^ as a space 
of D X D PSD matrices. Given two points Xj and Xj, and a PSD matrix M, the Maha- 
lanobis distance with respect to M between the two points is defined as ||xj — x^Hm = 
a/ (xj — Xj)'^M(xj — Xj). Our goal is to find a PSD matrix M* that minimizes a reasonable 
objective function /(•): 

M* = argmin/(M). (1) 
MeS^ 

Since the PSD matrix M can be decomposed to A^A, we can equivalently restate our 
problem as learning the best matrix A: 

A* = argmin/(A). (2) 
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Note that d = D in the standard setting, but for the purpose of dimensionahty reduction 
we can learn a low-rank projection by restricting d < D. After learning the best linear map 
A* , it will be used by kNN to compute the distance between two points in the transformed 
space as (xj — Xj)^M*(xj — Xj) = \\A*Xi — A*Xj\\'^. 

In the following subsections, three popular algorithms, whose objective functions are 
mainly designed for a further use of kNN classification, are presented. Despite their ef- 
ficiency and popularity, the three algorithms do not have their kernel versions, and thus 
in this paper we are primarily interested in kernelizing these three algorithms in order to 
improve their classification performances. 

2.1 Neighborhood Component Analysis (NCA) Algorithm 

The original goal of NCA (Goldberger et al., 2005) is to optimize the leave-one-out (LOO) 
performance on training data. However, as the actual LOO classification error of kNN is a 
non-smooth function of the matrix A, Goldberger et al. propose to minimize a stochastic 
variant of the LOO kNN score which is defined as follows: 

i yj=ci 

where 

_ exp{-\\Axi - AxjW^) _ 
'''^■"E.^.exp(-||Ax,-Ax,|P)' 

Optimizing /^^'^(•) can be done by applying a gradient based method. One major disad- 
vantage of NCA, however, is that f^'-^^{-) is not convex, and the gradient based methods 
are thus prone to local optima. 

2.2 Large Margin Nearest Neighbor (LMNN) Algorithm 

In LMNN (Weinberger et al., 2006), the output Mahalanobis distance is optimized with 
the goal that for each point, its k-nearest neighbors always belong to the same class while 
examples from different classes are separated by a large margin. 

For each point Xj, we define its k target neighbors as the k other inputs with the same 
label yi that are closest to Xj (with respect to the Euclidean distance in the input space). 
We use Wij £ {0, 1} to indicate whether an input Xj is a target neighbor of an input Xj. For 
convenience, we define yij G {0, 1} to indicate whether or not the labels yi and yj match. 
The objective function of LMNN is as follows: 

f^^^^{M) = Wij I |xi - Xj- 1 llf -h c ^ Wij{l - yn) [l + \\xi-Xj\\l^-\ jxj - x,| ||^] ^ , 

where [•]+ denotes the standard hinge loss: [z\^ = max{z,0). The term c > is a positive 
constant typically set by cross validation. The objective function above is convex^ and has 



1. There is a variation on LMNN called "large margin component analysis" (LMCA) (Torresani & Lee, 
2007) which proposes to optimize A instead of M; however, LMCA does not preserve some desirable 
properties, sueh as convexity, of LMNN, and therefore the algorithm "Kernel LMCA" presented there is 
different from "Kernel LMNN" presented in this paper. 
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two competing terms. The first term penalizes large distances between each input and its 
target neighbors, while the second term penalizes small distances between each input and 
all other inputs that do not share the same label. 

2.3 Discriminant Neighborhood Embedding (DNE) Algorithm 

The main idea of DNE (Zhang et al., 2007) is quite similar to LMNN. DNE seeks a hnear 
transformation such that neighborhood points in the same class are squeezed but those in 
different classes are separated as much as possible. However, DNE does not care about the 
notion of margin; in the case of LMNN, we want every point to stay far from points of other 
classes, but for DNE, we want the average distance between two neighborhood points of 
different classes to be large. Another difference is that LMNN can learn a full Mahalanobis 
distance, i.e. a general weighted linear projection, while DNE can learn only an unweighted 
linear projection. 

Similar to LMNN, we define two sets of k target neighbors for each point x, based on 

the Euclidean distance in the input space. For each Xj, let Neig\i) be the set of k nearest 
neighbors having the same label t/i, and let Neig^{i) be the set of k nearest neighbors 
having different labels from y^. We define Wij as follows 

if j e Neig^{{)y i e Neig^ij), 
"^ij = { -1> if 3 € Neig^{i) V i G Neig^{j), 
^ 0, otherwise. 

The objective function of DNE is: 

/^^^(a) = x;«'mIi^x,-ax,-ip. 

i,j 

which can be reformulated (up to a constant factor) to be 

fDNE^^^ = tTax^e{AX{D - W)X^A^), 

where W is a symmetric matrix with elements Wij , Z? is a diagonal matrix with Da = Wij 
and X is the matrix of input points (xi, x„). It is a well-known result from spectral graph 
theory (von Luxburg, 2007) that D—W is symmetric but is not necessarily PSD. To solve the 
problem by eigen-decomposition, the constraint AA^ = 7 is added (recall that A G W^^^ 
{d < D)) so that we have the following optimization problem: 

A* = argmintrace(ylX(D - W)X'^A^). (4) 

3. Kernelization 

In this section, we focus on two kernelization frameworks going to non-linearize the three 
algorithms presented in the previous section. First, the standard kernel trick framework 

is presented. Next, the KPCA trick framework which is an alternative to the kernel- 
trick framework is presented. Kernelization in this new framework is conveniently done 
with an application of kernel principal component analysis (KPCA). Finally, representer 
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theorems are proven to validate all applications of the two kernelization frameworks in the 
context of Mahalanobis distance learning. Note that in some previous works (Chen et al., 
2005; Globerson & Roweis, 2006; Yan et al., 2007; Torresani & Lee, 2007), the validity of 
applications of the the kernel trick has not been proven. 

3.1 Historical Background 

After finishing writing the current paper, we just knew that this name was first appeared 
in the paper of Chapelle and Scholkopf (2001) who first applied this method to invariant 
support vector machines; moreover, it is appeared to us that the KPCA trick has been 
known to some researchers (private communication to some ECML reviewers). Without 
knowing about this fact, we reinvented the framework and, coincidentally, called it "KPCA 
trick" ourselves. Nevertheless, we will shown in Section 5.1 that, in the context of Maha- 
lanobis distance learning, the KPCA trick non-trivially has many advantages over the kernel 
trick; we believe that this consequence is new and is not a consequence of previous works. 
Also, mathematical tools provided in previous works (Scholkopf & Smola, 2001; Chapelle 
& Scholkopf, 2001) are not enough to prove the validity of the KPCA trick in this context, 
and thus the new validation proof of the KPCA trick is needed (see our Theorem 1). 

3.2 The Kernel- Trick Framework 

Given a PSD kernel function k{-, •) (Scholkopf Sz Smola, 2001), we denote (f), (/)' and (f)i as 
mapped data (in a feature space associated with the kernel) of each example x, x' and Xj, 
respectively. A (squared) Mahalanobis distance under a matrix M in the feature space is 

{^i - <PjfM{<Pi - <Pj) = {<Pi - <PjfA^A{^i - 4>j). (5) 

To be consistent with Subsection 2.3, let = (ai, a^^). Denote a (possibly infinite- 
dimensional) matrix of the mapped training data $ = ((/>!, ...,</)„). The main idea of the 
kernel-trick framework is to parameterize (see representer theorems below) 

= (6) 

where U"^ = (ui, ...jU^;). Substituting A in Eq. (5) by using Eq. (6), we have 

{^i - ^jfM{ct)i - ^j) = (k, - kjfu'^Uiki - kj), 

where 

Now our formula depends only on an inner-product {((){, (f)j), and thus the kernel trick can 
be now applied by using the fact that fc(xj,Xj) = {(pi, (pj) for a PSD kernel function k{-, •). 
Therefore, the problem of finding the best Mahalanobis distance in the feature space is now 
reduced to finding the best linear transformation U of size d x n. Nonetheless, it often 
happens that finding U is much more troublesome than finding A in the input space, even 
their optimization problems look similar, as shown in Section 5.1. 

Once we find the matrix U, the Mahalanobis distance from a new test point x' to any 
input point Xj in the feature space can be calculated as follows: 

WcP' - <Pi\\l, = (k' - kifU^U{k' - ki), (8) 
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where k' = (A-(x', xi), fe(x', Xn))-^. kNN classification in the feature space can be per- 
formed based on Eq. (8). 

3.3 The KPCA-Trick Framework 

As we emphasize above, although the kernel trick framework can be applied to non-linearize 
the three learners introduced in Section 2, it often happens that finding U is much more 
troublesome than finding A in the input space, even their optimization problems look similar 
(sec Section 5.1). In this section, we develop a KPCA trick framework which can be much 
more conveniently applied to kernelize the three learners. 

Denote k{-, •), 4>i, cf) and cf)' as in Subsection 3.2. The central idea of the KPCA trick is to 
represent each (pi and (f)' in a new "finite" -dimensional space, without any loss of information. 
Within the framework, a new coordinate of each example is computed "explicitly" , and each 
example in the new coordinate is then used as the input of any existing Mahalanobis distance 
learner. As a result, by using the KPCA trick in place of the kernel trick, there is no need 
to derive new mathematical formulas and no need to implement new algorithms. 

To simplify the discussion of KPCA, we assume that is linearly independent and 
has its center at the origin, i.e. = (otherwise, {^j} can be centered by a simple 

pre-processing step (Shawe- Taylor & Cristianini, 2004, p. 115)). Since we have n total 
examples, the span of {0j} has dimensionality n. Here we claim that each example 0j 
can be represented as ipi G M" with respect to a new orthonormal basis {tpi^^^Y such that 
span({'i/'i}?=i) is the same as span({^i}"^^) without loss of any information. More precisely, 
we define 



where ^ = (V'l, V'n)- Note that although we may be unable to numerically represent 
each Ipi, an inner-product of can be conveniently computed by KPCA (or kernel 

Gram-Schmidt (Shawe- Taylor &; Cristianini, 2004)). Likewise, a new test point 4>' can be 
mapped to (p' = '^'^p'. Consequently, the mapped data {^Pi} and ip' are finite-dimensional 
and can be explicitly computed. 

3.3.1 The KPCA-trick Algorithm 

The KPCA-trick algorithm consisting of three simple steps is shown in Figure 1. NCA, 
LMNN, DNE and other learners, including those in other settings (e.g. semi-supervised 
settings), whose kernel versions are previously unknown (Yang et al., 2006; Xing et al., 
2003; Chatpatanasiri Sz Kijsirikul, 2008) can all be kernelized by this simple algorithm. 
Therefore, it is much more convenient to kernelize a learner by applying the KPCA-trick 
framework rather than applying the kernel-trick framework. In the algorithm, we denote a 
Mahalanobis distance learner by ma ha which performs the optimization process shown in 
Eq. 1 (or Eq. 2) and outputs the best Mahalanobis distance M* (or the best linear map 



3.3.2 Representer Theorems 

Is it valid to represent an infinite-dimensional vector (p hy a finite-dimensional vector 99? 
In the context of SVMs (Chapelle & Scholkopf, 2001), this validity of the KPCA trick is 




(9) 



A*). 
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Input: 1. training examples: {(xi, yi), (x„, 

2. new example: x', 

3. kernel function: •) 

4. Mahalanobis distance learning algorithm: ma ha 
Algorithm: 

(1) Apply kpca(A;, {xj}, x') such that {xj} {Vi} s-i^d x' ^ ip'. 

(2) Apply maha with new inputs {{(pi,yi)} to achieve M* or A*. 

(3) Perform kNN based on the distance 
\\<Pi - <p'\\m* or \\A*(pi - A*(p'\\. 



Figure 1: The KPCA-trick algorithm. 

easily achieved by straightforwardly extending a proof of an established representer theorem 
(Scholkopf et al., 2001)^. In the context of Mahalanobis distance learning, however, proofs 
provided in previous works cannot be directly extended. Note that, in the SVM cases 
considered in previous works, what is learned is a hyperplane, a linear functional outputting 
a 1-dimensional value. In our case, as shown in Eq. 6, what is learned is a linear map which, 
in general, outputs a countably infinite dimensional vector. Hence, to prove the validity of 
the KPCA trick in our case, we need some mathematical tools which can handle a countably 
infinite dimensionality. Below we give our versions of representer theorems which prove the 
validity of the KPCA trick in the current context. 

By our representer theorems, it is the fact that, given an objective function /(•) (see 
Eq. (1)), the optimal value of /(•) based on the input {cpi} is equal to the optimal value of 
/(•) based on the input Hence, the representation of ipi can be safely applied. We 

separate the problem of Mahalanobis distance learning into two different cases. The first 
theorem covers Mahalanobis distance learners (learning a full-rank linear transformation) 
while the second theorem covers dimensionality reduction algorithms (learning a low-rank 
linear transformation). 

Theorem 1. (Full-Rank Representer Theorem) Let {V'iliLi be a set of points in a feature 
space X such that s^'an({V'i}"=i) = span{{(j)i}'1^i) , and X andy be separable Hilbert spaces. 
For an objective function f depending only on {{A(f)i, A(f)j)}, the optimization 

mm : f{{A(Pi,A(j)i),...,{A(Pi,A(Pj),...,{A(Pn,A(Pn)) 
s.t. A : X ^ y is a bounded linear map , 

has the same optimal value as, 

min fi^^A'^A'^,, ^fA'^^A'^j, ^^A'^A'^n), 

where (pi = ((^j, -i^i), . . . , ((/>i, -i^n)) € W. 



2. A representer theorem, along with Mercer theorem, is a key ingredient for vahdating the kernel trick 
(Scholkopf & Smola, 2001). The origin of the classical representer theorem is dated back to at least 
1970s (Kimeldorf & Wahba, 1971). 
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To our knowledge, mathematical tools provided in the work of Scholkopf and Smola 
(2001, Chapter 4) are not enough to prove Theorem 1. The proof presented here is a non- 
straightforward extension of Scholkopf and Smola's work. We also note that Theorem 1 , as 
well as Theorem 2 shown below, is more general than what we discuss above. They justify 
both the kernel trick (by substituting tpi = (f)i and hence (pi = kj) and the KPCA trick (by 
substituting ipi = tpi and hence ipi = ipi). 

To start proving Theorem 1, the following lemma is useful. 

Lemma 1. Let X,y be two Hilhert spaces and y is separable, i.e. y has a, countable 
orthonormal basis {cijigN- Any bounded linear map A : X ^ y can be uniquely decomposed 
o-s J2'^i{-iTi)xei for some {rJigN C X. 

Proof. As A is bounded, the linear functional (p ^ {^4'^ ^i)y is bounded for every i since, 
by Cauchy-Schwarz inequality, ei)y| < ||j4^||||ej|| < ||74||||^||. By Riesz representation 
theorem, the map {A-.,ei)y can be written as {•,Ti)x for a unique G X. Since {ejjigN is 
an orthonormal basis of y, for every <p e X, Acp = Yli^ii^'Pi &i)y&i = Yl'^ii^i Ti)xei- D 

Proof. (Theorem 1) To avoid complicated notations, we omit subscripts such as X^y of 
inner products. The proof will consist of two steps. In the first step, we will prove the 
theorem by assuming that {V'iliLi is an orthonormal set. In the second step, we prove the 
theorem in general cases where {V'j}"^]^ is not necessarily orthonormal. The proof of the 
first step requires an application of Fubini theorem (Lewkeeratiyutkul, 2006) . 

Step 1. Assume that {tpi}^^]^ is an orthonormal set. Let {ei}'^^ be an orthonormal basis 
of 3^. For any (j)' G X, we have, by Lemma 1, Acf)' = Ylk^=i{4'\ Tk)^k- Hence, for each bounded 
linear map A: X ^ y, and (f), 0' G span{{ijji}f^i) , we have (Acl), A(f>') = J2T=i(4^ Tk){<P', Tfc). 

Note that Each can be decomposed as + such that lies in spa77,({V'i}^=i) and 
Tj^ is orthogonal to the span. These facts make (0',Tfc) = {4>',T'f^) for every k. Moreover, 
rj^ = Yl^=i '^kji^jj for some {u^i, Ukn} C M*^. Hence, we have 

oo oo 

{A<l>,Acl>')=J2{<P,rk){cP',Tk) = ^{cP,T'k){<P',4) 

k=l k=l 

oo n n 

k=l i=l i=l 

oo n 

= X X '^kiUkj{(p,1pi){4>','4'j) 
k=l i,j=l 
n / oo \ 

(Fubini theorem: explained below) = X] ''^^'^ki'^kj 1 {(t^:'4^i){4'' li'j) 

i,j=l \k=l / 
n 

= X Gij{<f>,i^i){<t>',i>j) 

= ^^Gip' = if^A'^A'p. 

At the fourth equality, we apply Fubini theorem to swap the two summations. To see that 
Fubini theorem can be applied at the fourth equality, we first note that J2'k^=i ^li i^ finite 
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for each i G {1. . .n} since 

oo oo n n 

= ^{'^i,^Ukji^j){iJi,^Ukji^j) = \ \Ai>i\\^ < oo. 
fe=i k=i j=i j=i 

Applying the above result together with Cauchy-Schwarz inequality and Fubini theorem for 
non-negative summation, we have 

oo n n oo 

XI XI WkiUkj{<P,1pi){(l)',i'j)\ = X ^\^kiUkj{<P,1pi){(l)' 
k=l i,j=l i,j=^ k=l 



n oo 



i,j=l fc=l 



CXJ o<j 

(E«i.)(E"l-) 

k=l k=l 



< OO. 



Hence, the summation converges absolutely and thus Fubini theorem can be applied as 
claimed above. Again, using the fact that Yl'k^=i ^fei < have that each element of G, 

Gij = Yl'k^=i "fJ-kiUkj, is finite. Furthermore, the matrix G is PSD since each of its elements 
can be regarded as an inner product of two vectors in £2- 

Hence, we finally have that {A(pi,A(f)j) = ipjA'^A'ipj, for each I < i,j < n. Hence, 
whenever a map A is given, we can construct A' such that it results in the same objective 
function value. By reversing the proof, it is easy to see that the converse is also true. The 
first step of the proof is finished. 

Step 2. We now prove the theorem without assuming that {^j}^^^ is an orthonormal 
set. Let all notations be the same as in Step 1. Let ^' be the matrix (-i/ji, ...,ipn)- Define 
{'0i}F=i as an orthonormal set such that spa?T'({V'i}"=i) = spci'T'({V'i}F=i) * = '0n) 
and ipi = Then, wc have that ijji = ^'Cj for some Cj G and ^' = where C = 

(ci,...,c.„). Moreover, since C map from an independent set {V'i} to another independent 
set {^"1}) ^ is invertible. We then have, for any A\ 

0jA'^A'^, = 4^'A'^A'^'^cP, 

= ^l^CA'^A'C^^^cPj 

= ^fCA'^A'C^cpj 

where A'C'^ = B and A' = B{C^)~^. Hence, for any B we have the matrix A' which gives 
ipjA'^A'ipj = (pjB^Bifij. Using the same arguments as in Step 1, we finish the proof of 
Step 2 and of Theorem 1. □ 

Theorem 2. (Low-Rank Representer Theorem) Define and (pi he as in Theorem 1. 

an objective function f depending only on {{A(f)i, A(f)j)}, the optimization 

nun : . . . , {A4>i, A(l)j), . . . , 

s.t. A : X is a bounded linear map , 
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has the same optimal value as, 



The proof of Theorem 2 is a generahzation of the proofs in previous works (Scholkopf 
& Smola, 2001, Chap. 4). 

Proof. (Theorem 2) Let {ei}f^i be the canonical basis of W^, and let (f) G spanji/Si, . . . ipn}- 
By Lemma 1, Acj) = 5^f=i(0,Tj)ei for some ri, . . . , G -Y. Each r, can be decomposed as 

Tj' + such that Tj' lies in spanj-^i, . . . , and is orthogonal to the span. These facts 
make {(p, Ti) = {(p, t[) for every i. We then have, for some Ujj G M, 1 <i < d, 1 < j < n, 



d n d n 

i=l j=l i=l j=l 



Uii ■■■ Uir 



Udl ■■■ Udn 



Since every cpi is in the span, we conclude that A^i = U(pi. Now, one can easily check that 
{Acpi, A(j).j) = (pJU^U(pj. Hence, whenever a map A is given, we can construct U such that 
it results in the same objective function value. By reversing the proof, it is easy to see that 
the converse is also true, and thus the theorem is proven (by renaming U to A'). □ 

Note that the proof of Theorem 2 cannot be directly used for proving Theorem 1 (let 
d = CO and U G M°°^", and Theorem 2 is still valid. However, to practically be useful, we 
need a finite-dimensional linear map. Hence, we must show that U^U G R"^" by proving 
that Uij < GO for alH, j). 



3.3.3 Remarks 

1. Note that by Mercer theorem (Scholkopf & Smola, 2001, pp. 37), we can either think of 
each (pi G I2 or (pi G for some positive integer N , and thus the assumption of Theorem 1 
that as well as y, is separable Hilbert space is then valid. Also, both theorems require 
that the objective function of a learning algorithm must depend only on yl(/)j)}"^-^^ 
or equivalently {{(pi^ M(pj)}^ This condition is, actually, not a strict condition since 
learners in literatures have their objective functions in this form (Chen et al., 2005; Gold- 
berger et al., 2005; Globerson &; Roweis, 2006; Weinberger et al., 2006; Yang et al., 2006; 
Sugiyama, 2006; Yan et al., 2007; Zhang et al., 2007; Torresani & Lee, 2007). 

2. Note that the two theorems stated in this section do not require {ipi} to be an or- 
thonormal set. However, there is an advantage of the KPCA trick which restricts ipi = ipi 
as in Eq. (9); this will be discussed in Sect. 5.1. 



3. A running time of each learner strongly depends on the dimensionality of the input 
data. As recommended by Weinberger et al. (2006), it can be helpful to first apply a di- 
mensionality reduction algorithm such as PCA before performing a learning process: the 
learning process can be tremendously speed up by retaining only, says, the 200 largest- 
variance principal components of the input data. In the KPCA trick framework illustrated 
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in Figure 1, dimensionality reduction can be performed without any extra work as KPCA 
is already applied at the first place. 

4. The stronger version of Theorem 2 can be achieved by inserting a regularizer into the 
objective function of a (kernelized) Mahalanobis distance learner as stated in Theorem 3. 
For compact notations, we use the fact that A is representable by {tj} as shown in Lemma 1. 

Theorem 3. (Strong Representer Theorem) Define {V'iliLi / Theorem 1. For 

monotonically increasing functions gi, let 

d 

h{Tu...,Td,^l,...,4>n) = f{{Tl,4>l),...,{Ti,^j),...,{Tn,^d))+^gi{\\Ti\\)- 

i=l 

Any optimal set of linear functionals 

argmin^^j /i(ti, ...,Td,(l)i, ...,</>«) 

s.t. ^i Ti : X ^ W is a bounded linear functional 

must admit the representation of Ti = Y7j=i Uijipj {i = 1, . . . ,d). 

The proof of this result is very similar to that of (Scholkopf & Smola, 2001, Theorem 
4.2) so that we omit its details here. In fact, to prove the validation of KDNE, we need this 
strong representer theorem (see the case of KPCA in Scholkopf and Smola (2001, pp.92)). 

To apply Theorem 3 to our framework, we can simply view A = (ti, ...,rrf)^. If each gi 
is the square function, then regularizer becomes Yli=i IkilP — W^Whs where || • \ \hs is the 
Hilbert-Schmidt (HS) norm of an operator. If each Tj is finite-dimensional, the HS norm 
is reduced to the Probenius norm || • Here, we allow the HS norm of a bounded linear 
operator to take a value of oo. For the kernel trick (by substituting tpi = (pi), the result 
above states that any optimal {tj} must be represented by {$Ui}. Therefore, using the 
same notation as Subsection 3.2, we have 

d d d d 

^SiiWnW) = Yl = ^uf$^$Ui = YufKui = trace (C/i^C/^). 

1=1 1=1 i=l 1=1 

This regularizer is first appeared in the work of Globerson and Roweis (2006). Similarly, for 
the KPCA trick (by substituting ipi = ipi), any optimal {rj} must be represented by {^'Ui} 
and, using the fact that ^'^^ = /, we have Yli=i IkilP = trace([/"?7^) = \ \U\\j,. 

By adding the regularizer, iTa,ce{UKU'^) or ||?7|||^, into existing objective functions, 
we have a new class of learners, namely, regularized Mahalanobis distance learners such 
as regularized KNCA (RKNCA), regularized KLMNN (RKLMNN) and regularized KDNE 
(RKDNE). Our framework can be further extended into a problem in semi-supervised set- 
tings by adding more complicated functions of gi{-) such as manifold regularizers, see e.g. 
Chatpatanasiri and Kijsirikul (2008). We plan to investigate effects of using various types 
of regularizers in the near future. 
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4. Selection of a Kernel Function 

The problem of selecting an efficient kernel function is central to all kernel machines. All 
previous works on Mahalanobis distance learners use exhaustive methods such as cross 
validation to select a kernel function. In this section, we investigate a possibility to auto- 
matically construct a kernel which is appropriate for a Mahalanobis distance learner. In the 
first part of this section, we consider a popular method called kernel alignment (Lanckriet 
et al., 2004; Zhu et al., 2005) which is able to learn, from a training set, a kernel in the 
form of k(-,-) = where ki(-, ■), ...,km{-, ■) are pre-chosen base kernels. In the 

second part of this section, we investigate a simple method which constructs an unweighted 
combination of base kernels, ki{-,-) (henceforth refered to as an unweighted kernel). A 
theoretical result is provided to support this simple approach. Kernel constructions based 
on our two approaches require much shorter running time when comparing to the standard 
cross validation approach. 



4.1 Kernel Alignment 

Here, our kernel alignment formulation belongs to the class of quadratic programs (QPs) 
which can be solved more efficiently than the formulations proposed by Lanckriet et al. 
(2004) and Zhu et al. (2005) which belong to the class of scmidefinite programs (SDPs) 
and quadratically constrained quadratic programs (QCQPs), respectively (Boyd &: Van- 
denberghe, 2004). 

To use kernel alignment in classification problems, the following assumption is central: 
for each couple of examples Xj,Xj, the ideal kernel A;(xj,Xj) is Yij (Guermeur et al., 2004) 
where 

+1, if Vi = Vj, 
I 1 , otherwise, 

and p is the number of classes in the training data. Denoting Y as the matrix having 
elements of Yij, we then define the alignment between the kernel matrix K and the ideal 
kernel matrix Y as follows: 

align(i^,y) = J^lll^, (10) 

where {■,-)f denotes the Probenius inner-product such that {K,Y)p = trace(K-^F) and 
II • I If is the Probenius norm induced by the Probenius inner-product. 

Assume that we have m kernel functions, fci (•,•),..., fc„j(-, •) and Ki, Km are their 
corresponding Gram matrices with respect to the training data. In this paper, the kernel 
function obtained from the alignment method is parameterized in the form of k{-,-) = 
^iCtiki{-,-) where > 0. Note that the obtained kernel function is guaranteed to be 
positive semidefinitc. In order to learn the best coefficients ai, a^, we solve the following 
optimization problem: 

{ai, ...,Q!m} = argmax align{K,Y), (11) 

Oj >0 

where K = ^^aiKi. Note that as K and Y are PSD, {K,Y)p > 0. Since both the 
numerator and denominator terms in the alignment equation can be arbitrary large, we can 
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simply fix the numerator to L We then reformulate the problem as follows: 
arg max align{K,Y) = arg min 

ai>0,{K,Y)F=l ai>0,(K,Y)F=l 

= arg min H-f^llF 

ai>0,{K,Y)F=i 

= arg min y ^aiaj{Ki, Kj)p. 

ai>0,j:iai{Ki,Y)p=l 

Defining a PSD matrix S whose elements Sij = {Ki, Kj)F,S' vector b = {{Ki,Y) f,---, {Km,Y) 
and a vector ot = {ai, ...,am)'^, we then reformulate Eq. (11) as follows: 

a= arg min a^Sa. (12) 

ai>0, cx'^h=l 

This optimization problem is a QP and can be efficiently solved (Boyd & Vandenberghe, 
2004); hence, we are able to learn the best kernel function k{-, ■) = X^jCKifci('j •) efficiently. 

Since the magnitudes of the optimal ai are varied due to it is convenient to use 

k^{-, •) = ki{-, ■)/\\Ki\\F and hence K'- = Ki/\\Ki\\F in the derivation of Eq. (12). We define 
S' and b' similar to S and b except that they are based on instead of K^. Let 

7= arg min 'y^S''y. (13) 

It is easy to see that the final kernel function k(-, ■) = X^j7ife^(-, •) achieved from Eq. (13) 
is not changed from the kernel achieved from Eq. (12). 

Note that we can further modify Eq. (12) to enforce sparseness of ex. and improve a speed 
of an algorithm by minimizing an upper bound of Hi^Hi*^ instead of minimizing the exact 
quantity so that the optimization formula belongs to the class of linear programs (LPs) 
instead of QPs. 

min ll-f'^llF < min ll'vec(i^)||i (14) 

ai>0,{K,Y)F=l ai>0,{K,Y)F=l 

where vec(-) denotes a standard "vec" operator converting a matrix to a vector (Minka, 
1997). By using a standard trick for an absolute- valued objective function (Boyd & Van- 
denberghe, 2004), Eq. (14) can be solved by linear programming. Note that the above 
optimization algorithm of minimizing the upper bound of a desired objective function is 
similar to the popular support vector machines where the hinge loss is minimized instead 
of the 0/1 loss. 

4.2 Unweighted Kernels 

In this subsection, we show that a very simple kernel k'{-,-) = X^jA;i(-,-) is theoretically 
efficient, no less than a kernel obtained from the alignment method. Denote 0^ as a mapped 
vector of an original example Xj by a map associated with a kernel k{-,-). The main idea 
of the contents presented in this section is the following simple but useful result. 

Proposition 1. Let {ai} be a set of positive coefficients, ai > for each i, and let 
ki{-, •),..., km{-, •) be base PSD kernels and k{-,-) = J2i'^if^ii'i ') '^'^d k'{-,-) = X^jfei(-,-)- 
Then, there exists an invertible linear map B such that B : ~^ 4>i for each i. 
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Proof. Without loss of generahty, we will concern here only the case of m = 2; the cases such 
that m > 2 can be proven by induction. Let 7ii®7ij be a direct sum of Tii and Tij where its 

inner product is defined by (•, + (•, and let {(ji'p} C T~ij denote a mapped training set 

associated with the j^^ base kernel. Then we can view (j)^ = {^/al(pf^^ , ^/a2<t>f^) &Hi® TLj 
since 

Similarly, we can also view (f)^' = {(f)[^\(j)\^^) eHi® Hj. Let Ij be the identity map in Hj. 
Then, 

B = [v^-^l ^ 

[ ^/a^I2_ 

Since oo > ai,a2 > and B is bounded (the operator norm of B is max(y^, y^o^)), B is 
invertible. □ 

Now suppose we apply the kernel k{-, •) = Oiiki{-, •) obtained from the kernel align- 
ment method to a Mahalanobis distance learner and an optimal transformation A* is 
returned. Let /(•) be an objective function which depends only on an inner product 
{Acpi, A(pj) (as assumed in Theorems 1 and 2). Since, from Proposition 1, {A* (p^ , A* (pj) = 
{A*B(Pf,A*B(Pf), we have 

r^f [{{A*<plA*cp';)}) = f [{{A*B<pf,A*Bcpf)]) . 

Thus, by applying a training set {0^ } to a learner who tries to minimize /(•), a learner will 
return a linear map with the objective value less than or equal to /* (because the learner 
can at least return A*B). Notice that because B is invertible, the value /* is in fact optimal. 
Consequently, the following claim can be stated: "there is no need to apply the methods 
which learn {aj}, e.g. the kernel alignment method, at least in theory, because learning 
with a simple kernel k'{-, •) also results in a linear map having the same optimal objective 
value" . However, in practice, there can be some differences between using the two kernels 
k{-, •) and k'{-, •) due to the following reasons. 

• Existence of a local solution. As some optimization problems are not convex, there 
is no guarantee that a solver is able to discover a global solution within a reasonable time. 
Usually, a learner discovers only a local solution, and hence two learners based on k{-,-) 
and k'(-, ■) will not give the same solution. KNCA belongs to this case. 

• Non-existence of the unique global solution. In some optimization problems, 
there can be many different linear maps having the same optimal values /* , and hence there 
is no guarantee that two learners based on k{-,-) and k'{-,-) will give the same solution. 
KLMNN is an example of this case. 

• Size constraints. Because of a size constraint such as AA^ = I used in KDNE, our 
arguments used in the previous subsection cannot be applied, i.e., given that A*'^A* = I, 
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there is no guaranteed that (A*B){A*B)'^ = I. Hence, A*B may not be an optimal solution 
of a learner based on k' (■,■). 

• Preprocessing of target neighbors. The behavior of some learners depends on 
their preprocesses. For example, before learning takes place, the KLMNN and KDNE 
algorithms have to specify the target neighbors of each point (by specifying a value of Wij). 
In a case of using the KPCA trick, this specification is based on the Euclidean distance 
with respect to a selected kernel (see Subsection 5.1.2 and Proposition 3). In this case, 
the Euclidean distance with respect to an aligned kernel k{-,-) (which already used some 
information of a training set) is more appropriate than the Euclidean distance with respect 
to an unweighted kernel k' {■,■). 

• Zero coefRcients. In the above proposition we assume > for all i. Often, the 
alignment algorithm return ttj = for some i. Define A* and /* as above. Following the 
same line of the proof of Proposition 1, in the cases that the alignment method gives = 
for some i, it can be easily shown that a learner with a kernel k'{-,-) will return a linear 
map with its objective value better than or equal to /*. 

Since constructing k'{-,-) is extremely easy, k'{-,-) is a very attractive choice to be used 
in kernelized algorithms. 



5. Demonstrations 

In this section, the advantages of the KPCA trick over the kernel trick are demonstrated. 
After that, we conduct extensive experiments to illustrate the performance of kernelized 
algorithms, especially for those applying the kernel construction methods described in the 
previous section. 



5.1 KPCA Trick versus Kernel Trick 

To understand the advantages of the KPCA trick over the kernel trick, it is best to derive a 
kernel trick formula for each algorithm and see what have to be done in order to implement 
a kernelized algorithm applying the kernel trick. In this section, we define and $ as in 
Section 3.2. 



5.1.1 KNCA 

As noted in Sect. 2.1, in order to minimize the objective of NCA and KNCA, we need to 
derive gradient formulas, and the formula of df^^^^/dA is (Goldberger et al., 2005): 

i \ k jeci J 

where for brevity we denote 4>ij = 4>i — 4>j. Nevertheless, since (pi may lie in an infinite 
dimensional space, the above formula cannot be always implemented in practice. In order 
to implement the kernel-trick version of KNCA, users need to prove the following proposition 
which is not stated in the original work of Goldberger ct al. (2005): 

Proposition 2. df^^^'^^/dA can be formulated as where V depends on {<pi} only in 

the form of {(pi, (pj) = k{xi, Xj), and thus we can compute all elements of V. 
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Proof. Define a matrix Bf = (0, 0, ...,(/>,..., 0, 0) as a matrix with its i*'^ column is (f) and 
zero vectors otherwise. Denote = kj — kj. Substitute A = U^^ to Eq. (15) we have 



df 



KNCA 



dA 



i k j6 

i k 



3&Ci 

which completes the proof. □ 

Therefore, at the i^^ iteration of an optimization step of a gradient optimizer, we needs 
to update the current best linear map as follows: 

a fKNCA 

dA 

= (16) 

where e is a step size. The kcrncl-trick formulas of KNCA are thus finally achieved. How- 
ever, we emphasize that the process of proving Proposition 2 and Eq. (16) is not trivial and 
may be tedious and difficult for non-experts as well as practitioners who focus their tasks 
on applications rather than theories. Moreover, since the formula of df^^^^/dA IS sig- 
nificantly different from df^'-'^/dA, users are required to re-implement KNCA (even they 
already possess an NCA implementation) which is again not at all convenient. In contrast, 
we note that all these difficulties are disappeared if the KPCA trick algorithm consisting of 
three simple steps shown in Fig. 1 is applied instead of the kernel trick. 

There is another advantage of using the KPCA trick on KNCA^. By the nature of a 
gradient optimizer, it takes a large amount of time for NCA and KNCA to converge to a 
local solution, and thus a method of speeding up the algorithms is needed. As recommended 
by Weinberger et al. (2006) , it can be helpful to first apply PCA before performing a learning 
process: the learning process can be tremendously speed up by retaining only, says, the 100 
largest-variance principal components of the input data. In the KPCA trick framework, no 
extra work is required for this speed-up task as KPCA is already applied at the first place. 



5.1.2 KLMNN 

Similar to KNCA, the online-available code of LMNN^ employs a gradient based opti- 
mization, and thus new gradient formulas in the feature space has to be derived and new 
implementation has to be done in order to apply the kernel trick. On the other hand, by 
applying the KPCA trick, the original LMNN code can be immediately used. 

There is another advantage of the KPCA trick on LMNN: LMNN requires a specification 
of Wij which is usually based on the quantity ||xj— Xj||. Thus, it makes sense that Wij should 

3. We slightly modify the code of Charloss Fowlkos: http://www.cs.berkeley.edu/~fowlkes/software/nca/ 

4. http:/ /www. weinbergerweb.net/Downloads/LMNN. htm I 
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be based on | — 1 1 = i^A;(xj,Xj) + k{'Xj,Xj) — 2A;(xj,Xj) with respect to the feature space 
of KLMNN, and hence, with the kernel trick, users have to modify the original code in order 
to appropriately specify Wij. In contrast, by applying the KPCA trick which restricts {ipi} 
to be an orthonormal set as in Eq. (9), we have the following proposition: 

Proposition 3. Let {V'iliLi be an orthonormal set such that span{{ipi}^^^) = span{{(pi}^^^) 
and (fi = ((^Ai, V'l), • • • , {(t>iAn)V G I^"; *^en Wipi-LpjW^ = \\4>i - 4>j\? for each 1 < i,j < n. 

Proof. Since we work on a separable Hilbert space A", we can extend the orthonormal set 
{i^i}^=i to {V'ili^i such that span{{ipi}'^]) is X and {(f)i,ipj) = for each i = l,...,n and 
j > n. Then, by an application of the Parseval identity (Lewkeeratiyutkul, 2006), 

oo n 

\\4>i - 4>j\\'^ = ^{4>i - 4>j,i^kf = ^{(l>i - (t>j^'^k)'^ 

k=l k=l 

The last equality comes from Eq.(9). □ 

Therefore, with the KPCA trick, the target neighbors Wij of each point is computed 
based on \ \ipi — ipj\ \ = \ \<pi — <?^'i|| without any modification of the original code. 

5.1.3 KDNE 

By applying A = and defining the gram matrix K = we have the following 

proposition. 

Proposition 4. The kernel-trick formula of KDNE is the following minimization problem: 

U* = arg min trace{UK{D - W)KU^). (17) 
UKU'^=I 

Note that this kernel-trick formula of KDNE involves a generalized eigenvalue problem in- 
stead of a plain eigenvalue problem involved in DNE. As a consequence, we face a singularity 
problem, i.e. if K is not full-rank, the constraint UKU^ = I cannot be satisfied. Using 
elementary linear algebra, it can be shown that K is not full-rank if and only if {^j} is not 
linearly independent, and this condition is not highly improbable. Sugiyama (2006), Yu 
and Yang (2001), and Yang and Yang (2003) suggest methods to cope with the singularity 
problem in the context of Fisher discriminant analysis which may be applicable to KDNE. 
Sugiyama (2006) recommends to use the constraint U{K + 61)11^ = I instead of the original 
constraint; however, an appropriate value of e has to be tuned by cross validation which 
is time-consuming. Alternatively, Yu and Yang (2001) and Yang and Yang (2003) propose 
more complicated methods of directly minimizing an objective function in the null space of 
the constraint matrix so that the singularity problem is explicitly avoided. 

We note that a KPCA-trick implementation of KDNE does not have this singularity 
problem as only a plain eigenvalue problem has to be solved. Moreover, as in KLMNN, 
applying the KPCA trick instead of the kernel trick to KDNE avoid the tedious task of 
modifying the original code to appropriately specify Wij in the feature space. 
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Figure 2: Two synthetic examples where NCA, LMNN and DNE cannot learn any efficient 
Mahalanobis distances for kNN. Note that in each example, data in each class lie 
on a simple non-linear 1-dimensional subspace (which, however, cannot be discov- 
ered by the three learners). In contrast, the kernel versions of the three algorithms 
(using the 2"'^-order polynomial kernel) can learn very efficient distances, i.e., the 
non-linear subspaces are discovered by the kernelized algorithms. 



Table 1: The average accuracy with standard deviation of NCA and their kernel versions. 

On the bottom row, the win/draw/lose statistics of each kernelized algorithm 
comparing to its original version is drawn. 



Name 


NCA 


KNCA 


Aligned KNCA 


Unweighted KNCA 


Balance 


0.89 ± 0.03 


0.92 ± 0.01 


0.92 ± 0.01 


0.91 ± 0.03 


Breast Cancer 


0.95 ± 0.01 


0.97 ± 0.01 


0.96 ± 0.01 


0.96 ± 0.02 


Glass 


0.61 ± 0.05 


0.69 ± 0.02 


0.69 ± 0.04 


0.68 ± 0.04 


Ionosphere 


0.83 ± 0.04 


0.94 ± 0.03 


0.92 ± 0.02 


0.90 ± 0.03 


Iris 


0.96 ± 0.03 


0.96 ± 0.01 


0.95 ± 0.03 


0.96 ± 0.02 


Musk2 


0.87 ± 0.02 


0.90 ± 0.01 


0.88 ± 0.02 


0.87 ± 0.02 


Pima 


0.68 ± 0.02 


0.71 ± 0.02 


0.67 ± 0.03 


0.69 ± 0.01 


Satellite 


0.82 ± 0.02 


0.84 ± 0.01 


0.84 ± 0.01 


0.82 ± 0.02 


Yeast 


0.47 ± 0.02 


0.50 ± 0.01 


0.49 ± 0.02 


0.47 ± 0.02 


Win/Draw/Lose 




8/1/0 


7/0/2 


5/4/0 



5.2 Numerical Experiments 

On page 8 of the LMNN paper (Weinberger et al., 2006), Weinberger et al. gave a com- 
ment about KLMNN: 'as LMNN already yields highly nonlinear decision boundaries in the 
original input space, however, it is not obvious that "kernelizing" the algorithm will lead 
to significant further improvement'. Here, before giving experimental results, we explain 
why "kernelizing" the algorithm can lead to significant improvements. The main intuition 
behind the kernelization of "Mahalanobis distance learners for the kNN classification al- 
gorithm" lies in the fact that non-linear boundaries produced by kNN (with or without 
Mahalanobis distance) is usually helpful for problems with multi-modalities; however, the 
non-linear boundaries of kNN is sometimes not helpful when data of the same class stay on 
a low-dimensional non- linear manifold as shown in Figure 2. 

In this section, we conduct experiments on NCA, LMNN, DNE and their kernel ver- 
sions on nine real- world datasets to show that (1) it is really the case that the kernelized 
algorithms usually outperform their original versions on real-world datasets, and (2) the 
performance of linearly combined kernels achieved by the two methods presented in Sec- 
tion 4 are comparable to kernels which are exhaustively selected, but the kernel alignment 
method requires much shorter running time. 
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Table 2: The average accuracy with standard deviation of LMNN and their kernel versions. 



Name 


LMNN 


KLMNN 


Aligned KLMNN 


Unweighted KLMNN 


Balance 


0.84 ± 0.04 


0.87 ± 0.01 


0.88 ± 0.02 


0.85 ± 0.01 


Breast Cancer 


0.95 ± 0.01 


0.97 ± 0.01 


0.97 ± 0.00 


0.97 ± 0.00 


Glass 


0.63 ± 0.05 


0.69 ± 0.04 


0.69 ± 0.04 


0.66 ± 0.05 


Ionosphere 


0.88 ± 0.02 


0.95 ± 0.02 


0.94 ± 0.02 


0.94 ± 0.02 


Iris 


0.95 ± 0.02 


0.96 ± 0.02 


0.95 ± 0.02 


0.97 ± 0.01 


Musk2 


0.80 ± 0.03 


0.93 ± 0.01 


0.88 ± 0.02 


0.86 ± 0.02 


Pima 


0.68 ± 0.02 


0.71 ± 0.02 


0.72 ± 0.02 


0.67 ± 0.03 


Satellite 


0.81 ± 0.01 


0.85 ± 0.01 


0.84 ± 0.01 


0.83 ± 0.02 


Yeast 


0.47 ± 0.02 


0.48 ± 0.02 


0.54 ± 0.02 


0.50 ± 0.02 


Win/Draw/Lose 




9/0/0 


8/1/0 


8/0/1 



Table 3: The average accuracy with standard deviation of DNE and their kernel versions. 



Name 


DNE 


KDNE 


Aligned KDNE 


Unweighted KDNE 


Balance 


0.79 ± 0.02 


0.90 ± 0.01 


0.83 ± 0.02 


0.85 ± 0.03 


Breast Cancer 


0.96 ± 0.01 


0.97 ± 0.01 


0.96 ± 0.01 


0.96 ± 0.02 


Glass 


0.65 ± 0.04 


0.70 ± 0.03 


0.69 ± 0.04 


0.65 ± 0.03 


Ionosphere 


0.87 ± 0.02 


0.95 ± 0.02 


0.95 ± 0.02 


0.93 ± 0.03 


Iris 


0.95 ± 0.02 


0.97 ± 0.02 


0.96 ± 0.02 


0.96 ± 0.03 


Musk2 


0.89 ± 0.02 


0.91 ± 0.01 


0.89 ± 0.02 


0.84 ± 0.03 


Pima 


0.67 ± 0.02 


0.69 ± 0.02 


0.70 ± 0.03 


0.70 ± 0.02 


Satellite 


0.84 ± 0.01 


0.85 ± 0.01 


0.85 ± 0.01 


0.81 ± 0.02 


Yeast 


0.40 ± 0.05 


0.48 ± 0.01 


0.47 ± 0.04 


0.52 ± 0.02 


Win/Draw/Lose 




9/0/0 


7/2/0 


5/2/2 



To measure the generalization performance of each algorithm, we use the nine real- 
world datasets obtained from the UCI repository (Asuncion & Newman, 2007): BALANCE, 
Breast Cancer, Glass, Ionosphere, Iris, Musk2, Pima, Satellite and Yeast. Fol- 
lowing previous works, we randomly divide each dataset into training and testing sets. By 
repeating the process 40 times, we have 40 training and testing sets for each dataset. The 
generalization performance of each algorithm is then measured by the average test accuracy 
over the 40 testing sets of each dataset. The number of training data is 200 except for 
Glass and Iris where we use 100 examples because these two datasets contain only 214 
and 150 total examples, respectively. 

Following previous works, we use the INN classifier in all experiments. In order to 
kernelize the algorithms, three approaches are applied to select appropriate kernels: 

• Cross validation (KNCA, KLMNN and KDNE). 

• Kernel alignment (Aligned KNCA, Aligned KLMNN and Aligned KDNE). 

• Unweighted combination of base kernels (Unweighted KNCA, Unweighted KLMNN 
and Unweighted KDNE). 

For all three methods, we consider scaled RBF base kernels (Scholkopf k, Smola, 2001, 
p. 216), k{x,y) = exp(— ^1|^|^) where D is the dimensionality of input data. Twenty 

one based kernels specified by the following values of a are considered: 0.01, 0.025, 0.05, 
0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10, 25, 50, 75, 100, 250, 500, 750, 1000. all ker- 
nelized algorithms are implemented by the KPCA trick illustrated in Figure 1. As noted 
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Figure 3: This figure illustrates performance of Unweighted KDNE with different num- 
ber of base kernels. It can be observed from the figure that the generalization 
performance of Unweighted KDNE will be eventually stable as we add more 
and more base kernels. 

in Subsection 4.2, the main problem of using the unweighted kernel to algorithms such as 
KLMNN and KDNE is that the Euclidean distance with respect to the unweighted kernel 
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is not informative and thus should not be used to specify target neighbors of each point. 
Therefore, in cases of KLMNN and KDNE which apply the unweighted kernel, we employ 
the Euclidean distance with respect to the input space to specify target neighbors. We 
slightly modify the original codes of LMNN and DNE to fulfill this desired specification. 

The experimental results are shown in Tables 1, 2 and 3. Prom the results, it is clear 
that the kernelizcd algorithms usually improve the performance of their original algorithms. 
The kernelized algorithms applying cross validation obtain the best performance. They out- 
perform the original methods in 26 out of 27 datasets. The other two kernel versions of 
the three original algorithms also have satisfiable performance. The kernelized algorithms 
applying kernel alignment outperform the original algorithms in 22 datasets and obtain an 
equal performance in 3 datasets. Only 2 out of 27 datasets where the original algorithms 
outperform the kernel algorithms applying kernel alignment. Similarly, the kernelized al- 
gorithms applying the unweighted kernel outperform the original algorithms in 18 datasets 
and obtain an equal performance in 6 datasets. Only 3 out of 27 datasets where the original 
algorithms outperform the kernel algorithms applying the unweighted kernel. 

We note that although the cross validation method usually gives the best performance, 
the other two kernel construction methods provide comparable results in much shorter 
running time. For each dataset, a run-time overhead of the kernelized algorithms applying 
cross validation is of several hours (on Pentium IV 1.5GHz, Ram 1 GB) while run-time 
overheads of the kernelized algorithms applying aligned kernels and the unweighted kernel 
are about minutes and seconds, respectively, for each dataset. Therefore, in time-limited 
circumstance, it is attractive to apply an aligned kernel or an unweighted kernel. 

Note that the kernel alignment method arc not appropriate for a multi-modal dataset 
in which there may be several clusters of data points for each class since, from eq. (10), the 
function align(K, Y) will attain the maximum value if and only if all points of the same class 
are collapsed into a single point. This may be one reason which explains why cross validated 
kernels give better results than results of aligned kernels in our experiments. Developing 
a new kernel alignment algorithm which suitable for multi-modality is currently an open 
problem. 

Comparing generalization performance induced by aligned kernels and the unweighted 
kernel, algorithms applying aligned kernels perform slightly better than algorithms applying 
the unweighted kernel. With little overhead and satisfiable performance, however, the 
unweighted kernel is still attractive for algorithms, like NCA (in contrast to LMNN and 
DNE), which are not required a specification of target neighbors Wij. Since Euclidean 
distance with respect to the unweighted kernel is usually not appropriate for specifying 
Wij, an KPCA-trick application of algorithms like LMNN and DNE may still require some 
re-programming. 

As noted in the previous section, aligned kernels usually does not use all base kernels 
(a.; = for some i); in contrast, the unweighted kernel uses all base kernels (oj = 1 for all 
i). Hence, as described in Section 4.2, the feature space corresponding to the unweighted 
kernel usually contains the feature space corresponding to aligned kernels. Therefore, we 
may informally say that the feature space induced by the unweighted kernel is "larger" than 
ones induced by aligned kernels. 

Since a feature space which is too large can lead to overfitting, one may wonder whether 
or not using the unweighted kernel leads to overfitting. Figure 3 shows that overfitting 
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indeed does not occur. For compactness, wc show only the results of Unweighted KDNE. 
In the experiments shown in this figure, base kernels are adding in the following order: 0.01, 
0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10, 25, 50, 75, 100, 250, 500, 750, 1000. 
It can be observed from the figure that the generalization performance of Unweighted 
KDNE will be eventually stable as we add more and more base kernels. Also, It can be 
observed that 10 - 14 base kernels are enough to obtain stable performance. It is interesting 
to further investigate an overfitting behavior of a learner by applying methods such as a 
bias-variance analysis (James, 2003) and investigate whether it is appropriate or not to 
apply an "adaptive resampling and combining" method (Breiman, 1998) to improve the 
classification performance of a supervised mahalanobis distance learner. 

6. Summary 

We have presented general frameworks to kernerlize Mahalanobis distance learners. Three 
recent algorithms are kernelized as examples. Although we have focused only on the super- 
vised settings, the frameworks are clearly applicable to learners in other settings as well, 
e.g. a semi-supervised learner. Two representor theorems which justify both our framework 
and those in previous works are formally proven. The theorems can also be applied to 
Mahalanobis distance learners in unsupervised and semi-supervised settings. Moreover, we 
present two methods which can be efficiently used for constructing a good kernel function 
from training data. Although we have concentrated only on Mahalanobis distance learners, 
our kernel construction methods can be indeed applied to all kernel classifiers. Numeri- 
cal results over various real-world datasets showed consistent improvements of kernelized 
learners over their original versions. 
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